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In this review we discuss, from a unified point of view, a variety of Monte 
Carlo methods used to solve eigenvalue problems in statistical mechanics and 
quantum mechanics. Although the applications of these methods differ widely, 
the underlying mathematics is quite similar in that they are stochastic imple- 
mentations of the power method. In all cases, optimized trial states can be 
used to reduce the errors of Monte Carlo estimates. 
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I. INTRODUCTION 

Many important problems in computational physics and chemistry can be reduced to the 
computation of dominant eigenvalues of matrices of high or infinite order. We shall focus on 
just a few of the numerous examples of such matrices, namely, quantum mechanical Hamil- 
tonians, Markov matrices and transfer matrices. Quantum Hamiltonians, unlike the other 
two, probably can do without introduction. Markov matrices are used both in equilibrium 
and nonequilibrium statistical mechanics to describe dynamical phenomena. Transfer ma- 
trices were introduced by Kramers and Wannier in 1941 to study the two-dimensional Ising 
model,0 and ever since, important work on lattice models in classical statistical mechanics 
has been done with transfer matrices, producing both exact and numerical results.! 

The basic Monte Carlo methods reviewed in this chapter have been used in many different 
contexts and under many different names for many decades, but we emphasize the solution 
of eigenvalue problems by means of Monte Carlo methods and present the methods from 
a unified point of view. A vital ingredient in the methods discussed here is the use of 



optimized trial functions. Section [TV] deals with this topic briefly, but in general we suppose 
that optimized trial functions are given. We refer the reader to Ref. |3] for more details on 
their construction. 

The analogy of the time-evolution operator in quantum mechanics on the one hand, and 
the transfer matrix and the Markov matrix in statistical mechanics on the other, allows the 
two fields to share numerous techniques. Specifically, a transfer matrix G of a statistical 
mechanical lattice system in d dimensions often can be interpreted as the evolution operator 
in discrete, imaginary time t of a quantum mechanical analog in d — 1 dimensions. That 
is, G ~ exp(— tH), where 7i is the Hamiltonian of a system in d — 1 dimensions, the 
quantum mechanical analog of the statistical mechanical system. From this point of view, 
the computation of the partition function and of the ground-state energy are essentially the 
same problems: finding the largest eigenvalue of G and of exp(— fH), respectively. As far 
as the Markov matrix is concerned, this simply is the time-evolution operator of a system 
evolving according to stochastic dynamics. The largest eigenvalue of such matrices equals 
unity, as follows from conservation of probability, and for systems in thermal equilibrium, 
the corresponding eigenstate is also known, namely the Boltzmann distribution. Clearly, 
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the dominant eigenstate in this case poses no problem. For nonequilibrium systems, the 
stationary state is unknown and one might use the methods described in this chapter in 
dealing with them. Another problem is the computation of the relaxation time of a system 
with stochastic dynamics. This problem is equivalent to the computation of the second 
largest eigenvalue of the Markov matrix, and again the current methods apply. 

The emphasis of this chapter is on methods rather than applications, but the reader 
should have a general idea of the kind of problems for which these methods can be employed. 
Therefore, we start off by giving some specific examples of the physical systems one can deal 
with. 



A. Quantum Systems 

In the case of a quantum mechanical system, the problem in general is to compute expec- 
tation values, in particular the energy, of bosonic or fermionic ground or excited eigenstates. 
For systems with n electrons, the spatial coordinates are denoted by a 3n-dimensional vec- 
tor R. In terms of the vectors r; specifying the coordinates of electron number i this reads 
R = (ri, . . . , r n ). The dimensionless Hamiltonian is of the form 

<R|W|R'> = [-^V 2 + V(R)]5(R - R'). (1) 

For atoms or molecules atomic units are used fx — 1, and V is the usual Coulomb potential 
acting between the electrons and between the electrons and nuclei i.e., 

v(R) = E^-E^ (2) 

i<j ' l i a,i ' m 

where for arbitrary subscripts a and b we define r a b = \r a — r&| ; indices i and j label the 
electrons, and we assume that the nuclei are of infinite mass and that nucleus a has charge 
Z a and is located at position r a . 

In the case of quantum mechanical van der Waals clusters,S@ fx is the reduced mass - 
fx = 2imea/fi 2 in terms of the mass m, Planck's constant % and the conventional Lennard- 
Jones parameters e and o — and the potential is given by 

V(R) = e4-4 (3) 

i<j ij ij 

The quantum nature of the system increases with 1/fi 2 , which is proportional to the con- 
ventional de Boer parameter. 

The ground-state wavefunction of a bosonic system is positive everywhere, which is very 
convenient in a Monte Carlo context and allows one to obtain results with an accuracy that 
is limited only by practical considerations. For fermionic systems, the ground-state wave 
function has nodes, and this places more fundamental limits on the accuracy one can obtain 
with reasonable effort. In the methods discussed in this chapter, this bound on the accuracy 
takes the form of the so-called fixed-node approximation. Here one assumes that the nodal 
surface is given, and computes the ground-state wave function subject to this constraint. 
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The time-evolution operator exp(— tTC) in the position representation is the Green func- 
tion 

G(R',R,r) = (R'\e- TH \R). (4) 

For both bosonic systems and fermionic systems in the fixed-node approximation, G has 
only nonnegative elements. This is essential for the Monte Carlo methods discussed here. A 
problem specific to quantum mechanical systems is that G is only known asymptotically for 
short times, so that the finite-time Green function has to be constructed by the application 
of the generalized Trotter formulaifl, i.e., G(t) = linim^oo G(r/m) m , where the position 
variables of G have been suppressed. 



B. Transfer Matrices 

Our next example is the transfer matrix of statistical mechanics. The largest eigenvalue 
yields the free energy, from which all thermodynamic properties follow. As a typical transfer 
matrix, one can think of the one-site, Kramers- Wannier transfer matrix for a two-dimensional 
model of Ising spins, s, = ±1. Such a matrix takes a particularly simple form for a square 
lattice wrapped on a cylinder with helical boundary conditions with pitch one. This produces 
a mismatch of one lattice site for a path on the lattice around the cylinder. This geometry 
has the advantage that a two-dimensional lattice can be built one site at a time and that 
the process of adding each single site is identical each time. Suppose we choose a lattice of 
M sites, wrapped on a cylinder with a circumference of L lattice spaces. Imagine that we 
are adding sites so that the lattice grows toward the left. We can then define a conditional 
partition function Z M (S), which is a sum over those states (also referred to as configurations) 
for which the left-most edge of the lattice is in a given state S. The physical interpretation 
of Za/(S) is the relative probability of finding the left-most edge in a state S with which the 
rest of the lattice to its right is in thermal equilibrium. 

If one has helical boundary conditions and spins that interact only with their nearest 
neighbors, one can repeatedly add just a single site and the bonds connecting it to its 
neighbors above and to the right. Analogously, the transfer matrix G can be used to compute 
recursively the conditional partition function of a lattice with one additional site 

Z M+1 (S') = J2G(S'\S)Z M (S), (5) 
s 

with 

G(S'\S)=exp[K(s' 1 s 1 + s' 1 s L )]f[6 s>t _ 1 , (6) 

with S = (si, s 2 , • • • , sl) and S' = (s[, s' 2 , . . . , s' L ), and the Sj, s[ = ±1 are Ising spins. With 
this definition of the transfer matrix, the matrix multiplication in Eq. (|5]) accomplishes the 
following: (1) a new site, labeled 1, is appended to the lattice at the left edge; (2) the 
Boltzmann weight is updated to that of the lattice with increased size; (3) the old site L 
is thermalized; and finally (4) old sites 1, . . . , L — 1 are pushed down on the stack and are 
renamed to 2, . . . , L. Sites have to remain in the stack until all interacting sites have been 
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FIG. 1. Graphical representation of the transfer matrix. The primed variables are associated with the 
circles and combine into the left index of the matrix; the dots go with the right index and the unprimed 
variables. Coincidence of a circle and a dot produces a (5-function. An edge indicates a contribution to the 
Boltzmann weight. Repeated application of this matrix constructs a lattice with nearest neighbor bonds 
and helical boundary conditions. 

added, which determines the minimal depth of the stack. It is clear from Figure [I] that 
the transfer matrix is nonsymmetric and indeed symmetry is not required for the methods 
discussed in this chapter. It is of some interest that transfer matrices usually have the 
property that a right eigenvector can be transformed into a left eigenvector by a simple 
permutation and reweighting transformation. The details are not important here and let 
it suffice to mention that this follows from an obvious symmetry of the diagram shown 
in Figure % (1) rotate over 7r about an axis perpendicular to the paper, which permutes 
the states; and (2) move the vertical bond back to its original position, which amounts to 
reweighting by the Boltzmann weight of a single bond. 

Equation (|5]) implies that for large M and generic boundary conditions at the right-hand 
edge of the lattice, the partition function approaches a dominant right eigenvector x/jq of the 
transfer matrix G 

Z M (S) oc A A Vo(S), (7) 

where Ao is the dominant eigenvalue. Consequently, for M — > oo the free energy per site is 
given by 

f = -kT\n\ . (8) 

The problem relevant to this chapter is the computation of the eigenvalue Ao by Monte 
Carlo.! 



C. Markov Matrices 

Discrete-time Markov processes are a third type of problem we shall discuss. One of the 
challenges in this case is to compute the correlation time of such a process in the vicinity 
of a critical point, where the correlation time goes to infinity, a phenomenon called "critical 
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slowing down". Computationally, the problem amounts to the evaluation of the second 
largest eigenvalue of the Markov matrix, or more precisely its difference from unity. The 
latter goes to zero as the correlation time approaches infinity. 

The Markov matrix defines the stochastic evolution of the system in discrete time. That 
is, suppose that at time t the probability of finding the system in state S is given by pt(S). 
If the probability of making a transition from state S to state S' is P(S'|S) (sorry about the 
hat, we shall take it off soon!), then 

Pm(S / ) = EnS'|S) P ,(S) (9) 
s 

In the case of interest here, the Markov matrix P is constructed so that its stationary state 
is the Boltzmann distribution ?/> B = exp(— f37i). Sufficient conditions are that (a) each state 
can be reached from every state in a finite number of transitions and that (b) P satisfies 
detailed balance 

P(S'|S)^ b (S) 2 = P(S|S>b(S') 2 . (10) 
It immediately follows from detailed balance that the matrix P defined by 

ns'is) = ^L^p(s'|s)v»b(s). (ii) 

is symmetric and equivalent by similarity transformation to the original Markov matrix. 
Because of this symmetry, expressions tend to take a simpler form when P is used, as we 
shall do, but one should keep in mind that P itself is not a Markov matrix, since the sum 
on its left index does not yield unity identically. 

Again, to provide a specific example, we mention that the methods discussed below have 
been appliediS to an Ising model on a square lattice with the heat bath or Yang 3 transition 
probabilities and random site selection. In that case, time evolution takes place by single 
spin-flip transitions which occur between a given state S to one of the states S' that differ 
only at one randomly selected site. For any such pair of states, the transition probability is 
given by 

P(S'|S)=4 27Vcoshi/3AW f 3 (12) 

U-Es'VS^SlS) for S' = S, 
for a system of N sites with Hamiltonian 

-(m = KY,s i s j + K'Y,s i Si, (13) 

(m) (m)' 

where denotes nearest-neighbor pairs, and denotes next-nearest-neighbor pairs 

and AH = H(S') - H(S). 

We note that whereas the transfer matrix is used to deal with systems that are infi- 
nite in one direction, the systems used in the dynamical computations are of finite spatial 
dimensions only. 
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II. THE POWER METHOD 



Before discussing technical details of Monte Carlo methods to compute eigenvalues and 
expectation values, we introduce the mathematical ideas and the types of expressions for 
which statistical estimates are sought. We formulate the problem in terms of an operator 
G of which one wants to compute the dominant eigenstate and eigenvalue, |?/>o) and Ao- 
Mathematically, but not necessarily in a Monte Carlo setting, dominant may mean dominant 
relative to eigenstates of a given symmetry only. 

The methods to be discussed are variations of the power method, which relies on the fact 
that for a generic initial state of the appropriate symmetry, the states \u®) defined by 

\u^) = —G\u®) (14) 
c t +i 

converge to the dominant eigenstate |^ ) of G, if the constants c t are chosen so that \u®) 
assumes a standard form, in which case the constants c t converge to the dominant eigenvalue. 
This follows immediately by expanding the initial state I?/ -*) in eigenstates of G. One 
possible standard form is that, in some convenient representation, the component of \u®) 
largest in magnitude equals unity. 

For quantum mechanical systems, G usually is the imaginary-time evolution operator, 
exp(— tTI). As mentioned above, a technical problem in that case is that an explicit expres- 
sion is known only asymptotically for short times r. In practice, this asymptotic expression 
is used for a small but finite r and this leads to systematic, time-step errors. We shall deal 
with this problem below at length, but ignore it for the time being. 

The exponential operator exp(— tTL) is one of various alternatives that can be employed 
to compute the ground-state properties of the Hamiltonian. If the latter is bounded from 
above, one may be able to use II — tH, where r should be small enough that A = 1 — tE 
is the dominant eigenvalue of 11 — tH. In this case, there is no time-step error and the same 
holds for yet another method of inverting the spectrum of the Hamiltonian, viz. the Green 
function Monte Carlo method. There one uses (H — E) -1 , where E is a constant chosen 
so that the ground state becomes the dominant eigenstate of this operator. In a Monte 
Carlo context, matrix elements of the respective operators are proportional to transition 
probabilities and therefore have to be non-negative, which, if one uses either of the last two 
methods, may impose further restrictions on the values of r and E. 

For the statistical mechanical applications, the operators G are indeed evolution oper- 
ators by construction. The transfer matrix evolves the physical system in a spatial rather 
than time direction, but this spatial direction corresponds to time from the point of view 
of a Monte Carlo time series. With this in mind, we shall refer to the operator G as the 
evolution operator, or the Monte Carlo evolution operator, if it is necessary to distinguish 
it from the usual time-evolution operator exp(— tTC). 

Suppose that X is an operator of which one wants to compute an expectation value. 
Particularly simple to deal with are the cases in which the operators X and G are the same 
or commute. We introduce the following notation. Suppose that \u a ) and \up) are two 
states, then X^'^ denotes the matrix element 

_ (u a \G?'XGi>\up) 

x <* ~ • (15) 
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This definition is chosen to simplify the discussion, and generalization to physically relevant 
expressions, such as Eq. ^6|, is straightforward. 

Various Monte Carlo methods are designed to estimate particular instances of X^p' p \ 
and often the ultimate goal is to compute the expectation value in the dominant eigenstate 

x " ~ 1MW (16) 

which reduces to an expression for the dominant eigenvalue of interest if one chooses for 
X, in the applications discussed in the introduction, the Hamiltonian, transfer or Markov 
matrix. 

The simplest method is the variational Monte Carlo method, discussed in the next sec- 
tion. Here an approximate expectation value is computed by employing an approximate 
eigenvector of G. Typically, this is an optimized trial state, say |wt), m which case vari- 
ational Monte Carlo yields X^f\ which is simply the expectation value of X in the trial 
state. Clearly, variational Monte Carlo estimates of Xq have both systematic and statistical 
errors. 

The variational error can be removed asymptotically by projecting out the dominant 
eigenstate, i.e., by reducing the spectral weight of sub-dominant eigenstates by means of 
the power method. The simplest case is obtained if one applies the power method only to 



the right on the state \up) but not to the left on (u a \ in Eq. (15). Mathematically, this 
is the essence of diffusion and transfer matrix Monte Carlo, and in this way one obtains 
the desired result X Q if the operator X commutes with the Monte Carlo evolution operator 
G. In our notation, this means that X is given by the statistical estimate of X^°\ In 
principle, this yields an unbiased estimate of X , but in practice one has to choose p finite 
but large enough that the estimated systematic error is less than the statistical error. In 
some practical situations it can in fact be difficult to ascertain that this indeed is the case. 
If one is interested in the limiting behavior for infinite p or p', the state \u a ) or \up) need not 
be available in closed form. This freedom translates into the flexibility in algorithms design 
exploited in diffusion and transfer matrix Monte Carlo. 

If G and X do not commute, the mixed estimator X^^ is not the desired result, but the 
residual systematic error can be reduced by combining the variational and mixed estimates 
by means of the expression 

X = 2Xft oo) - Xft 0) + 0[(|lfo> - K» 2 ] (17) 

To remove the variational bias systematically, if G and X do not commute, the power 
method must be used to both the left and the right in Eq. (|T5T). Thus one obtains from 
X^'°°^ an exact estimate of X subject only to statistical errors. Of course, one has to pay 
the price of the implied double limit in terms of loss of statistical accuracy. In the context 
of the Monte Carlo algorithms discussed below, such as diffusion and transfer matrix Monte 
Carlo, this double projection technique to estimate xf^' 00 ^ is called forward walking or future 
walking. 

We end this section on the power method with a brief discussion of the computational 
complexity of using the Monte Carlo method for eigenvalue problems. In Monte Carlo 
computations one can distinguish operations of three levels of computational complexity, 
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depending on whether the operations have to do with single particles or lattice sites, the 
whole system, or state space summation or integration. The first typically involves a fixed 
number of elementary arithmetic operations, whereas this number clearly is at least pro- 
portional to the system size in the second case. Exhaustive state-space summation grows 
exponentially in the total system size, and for these problems Monte Carlo is often the only 
viable option. 

Next, the convergence of the power method itself comes into play. The number of iter- 
ations required to reach a certain given accuracy is proportional to log | Aq/Ai|, where the 
A and Ai are the eigenvalues of largest and second largest magnitude. If one is dealing 
with a single-site transfer matrix of a critical system, that spectral gap is proportional to 
L~ d for a system in d dimensions with a cross section of linear dimension L. In this case, a 
single matrix multiplication is of the complexity of a one-particle problem. In contrast, both 
for the Markov matrix defined above, and the quantum mechanical evolution operator, the 
matrix multiplication itself is of system-size complexity. Moreover, both of these operators 
have their own specific problems. The quantum evolution operator of G(t) has a gap on 
the order of r, which means that r should be chosen large for rapid convergence, but one 
does not obtain the correct results, because of the time-step error, unless r is small. Finally, 
the spectrum of the Markov matrix displays critical slowing down. This means that the gap 
of the single spin-flip matrix is on the order of L~ d ~ z , where z is typically a little bigger 
than twoB These convergence properties are well understood in terms of the mathematics 
of the power method. Not well understood, however, are problems that are specific to the 
Monte Carlo implementation of this method, which in some form or another introduces 
multiplicative fluctuating weights that are correlated with the quantities of interest .UMa 



III. SINGLE-THREAD MONTE CARLO 

In the previous section we have presented the mathematical expressions that can be 
evaluated with the Monte Carlo algorithms to be discussed next. The first algorithm is 
designed to compute an approximate statistical estimate of the matrix element X Q by means 
of the variational estimate X^f\ We writef] (S|«t) = («t|S) = «t(S) and for non- 
vanishing «t(S) define the configurational eigenvalue Xt(S) by 



X T (S)u T (S) = (S|A> T ) = ^(S|X|S')(S> T ) 

S' 



(18) 



This yields 



- ( o,o) EsMS) 2 X T (S) 



*We assume throughout that the states we are dealing with are represented by real numbers 
and that X is represented by a real, symmetric matrix. In many cases, generalization to complex 
numbers is trivial, but for some physical problems, while formal generalization may still be possible, 
the resulting Monte Carlo algorithms may be too noisy to be practical. 
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which shows that Xtt can be evaluated as a time average over a Monte Carlo time series of 
states Si, S2, • • • sampled from the probability distribution «t(S) 2 , i.e., a process in which 
Prob(S), the probability of finding a state S at any time is given by 

Prob(S) oc MS) 2 - (20) 

For such a process, the ensemble average in Eq. ( P^) can be written in the form of a time 
average 

X^ 0) = lira yEXrCSt). (21) 

For this to be of practical use, it has to be assumed that the configurational eigenvalue 
Xt(S) can be computed efficiently, which is the case if the sum over states S' in (S|X|ut) = 
X)s'(S|-X"|S')(S'|mt) can be performed explicitly. For discrete states this means that X 
should be represented by a sparse matrix; if the states S form a continuum, Xt(S) can 
be computed directly if X is diagonal or near- diagonal, i.e., involves no or only low-order 
derivatives in the representation used. The more complicated case of an operator X with 
arbitrarily nonvanishing off-diagonal elements will be discussed at the end of this section. 

An important special case, relevant for example to electronic structure calculations, is 
to choose for the operator X the Hamiltonian TC and for S the 3iV-dimensional real-space 
configuration of the system. Then, the quantity X-r is called the local energy, denoted by E-^. 
Clearly, in the ideal case that |«t) is an exact eigenvector of the evolution operator G, and 
if X commutes with G then the configurational eigenvalue Xt(S) is a constant independent 
of S and equals the true eigenvalue of X. In this case the variance of the Monte Carlo 
estimator in Eq. (|21|) goes to zero, which is an important zero-variance principle satisfied by 
variational Monte Carlo. The practical implication is that the efficiency of the Monte Carlo 
computation of the energy can be improved arbitrarily by improving the quality of the trial 
function. Of course, usually the time required for the computation of «t(S) increases as the 
approximation becomes more sophisticated. For the energy the optimal choice minimizes 
the product of variance and time; no such optimum exists for an operator that does not 
commute with G or if one makes the fixed node approximation, described in Section |V1 A2| , 
since in these cases the results have a systematic error that depends on the quality of the 
trial wavefunction. 

A. Metropolis Method 

A Monte Carlo process sampling the probability distribution «t(S) 2 is usually generated 
by means of the generalized Metropolis algorithm, as follows. Suppose a configuration S is 
given at time t of the Monte Carlo process. A new configuration S' at time t + 1 is generated 
by means of a stochastic process that consists of two steps: (1) an intermediate configuration 
S" is proposed with probability I1(S"|S); (2. a) S' = S" with probability p = A(S"\S), i.e., 
the proposed configuration is accepted; (2.b) S' = S with probability q = 1 — A(S"|S), i.e., 
the proposed configuration is rejected and the old configuration S is promoted to time t + 1. 
More explicitly, the Monte Carlo sample is generated by means of a Markov matrix P with 
elements P(S'|S) of the form 
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p/q/iox / A(S'|S)II(S'|S) for svs 

1 1 j \ 1 - £ S , VS A(S"|S) n(S"|S) for S' = S, 



(22) 



if the states are discrete and 



P(S'|S) = A(S'|S) n(S'|S) + 



i - J ds"A(s"\s) n(s"|s) 



6(S' - S) (23) 



if the states are continuous. Correspondingly, in Eq. (|22| ) II and P are probabilities while 
in Eq. (|23|) they are probability densities. 

The Markov matrix P is designed to satisfy detailed balance 

P(S'|S) MS) 2 = P(S|S') n T (S') 2 , (24) 

so that, if the process has a unique stationary distribution, this will be «t(S) 2 , as desired. 
In principle, one has great freedom in the choice of the proposal matrix II, but it is necessary 
to satisfy the requirement that transitions can be made between (almost) any pair of states 
with nonvanishing probability (density) in a finite number of steps. 

Once a proposal matrix II is selected, an acceptance matrix is defined so that detailed 
balance, Eq. (|24]), is satisfied 



A(S'|S) 



mm 



, n(s|SQ MS') 2 " 
' n(S'|S) MS) 2 ' 



For a given choice of II, infinitely many choices can be made for A that satisfy detailed 
balance but the preceding choice is the one with the largest acceptance. We note that if the 
preceding algorithm is used, then Xr(Sf) in the sum in Eq. (|2lD , can be replaced by the 
expectation value conditional on S" having been proposed: 

Xft 0) = hm j ^t(S0 + ftXr(St)] , (26) 

where p t = 1 — q t is the probability of accepting S". This has the advantage of reducing the 
statistical error somewhat, since now Xt(S") contributes to the average even for rejected 
moves, and will increase efficiency if Xt(S") is readily available. 

If the proposal matrix II is symmetric, as is the case if one samples from a distribution 
uniform over a cube centered at S, such as in the original Metropolis methodic, the factors 
of II in the numerator and denominator of Eq. fl25|) cancel. 

Finally we note that it is not necessary to sample the distribution u\ to compute Xtt: 
any distribution that has sufficient overlap with u\ will do. To make this point more 
explicitly, let us introduce the average of some stochastic variable X with respect to an 
arbitrary distribution p: 

{X) ' = EsP(S) (27) 

The following relation shows that the desired results can be obtained by reweighting, i.e., 
any distribution p will suffice as long as the ratio «t(S) 2 / p(S) does not fluctuate too wildly: 
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Xtt - - (u\/p) p ■ (28) 

This is particularly useful for calculation of the difference of expectation values with respect 
to two closely related distributions. An example of thisffi is the calculation of the energy 
of a molecule as a function of the inter-nuclear distance. 



B. Projector Monte Carlo and Importance Sampling 

The generalized Metropolis method is a very powerful way to sample an arbitrary given 
distribution, and it allows one to construct infinitely many Markov processes with the desired 
distribution as the stationary state. None of these, however, may be appropriate to design 
a Monte Carlo version of the power method to solve eigenvalue problems. In this case, the 
evolution operator G is given, possibly in approximate form, and its dominant eigenstate 
may not be known. To construct an appropriate Monte Carlo process, the first problem 
is that G itself is not a Markov matrix, i.e., it may violate one or both of the properties 
G(S'|S) > and J2s' G(S'\S) = 1. This problem can be solved if we can find a factorization 
of the evolution matrix G into a Markov matrix P and a weight matrix g with non-negative 
elements such that 

G(S'|S) = 0(S'|S)P(S'|S). (29) 

The weights g must be finite, and this almost always precludes use of the Metropolis method 
for continuous systems, as can be understood as follows. Since there is a finite probabil- 
ity that a proposed state will be rejected, the Markov matrix P(S'|S) will contain terms 
involving 5(S — S'), but generically, G will not have the same structure and will not allow 
the definition of finite weights g according to Eq. ( p8|) . However, the Metropolis algorithm 
can be incorporated as a component of an algorithm if an approximate stationary state is 
known and if further approximations are made, as in the diffusion Monte Carlo algorithm 



discussed in Section VI B 



As a comment on the side we note that violation of the condition that the weight g be 
positive results in the notorious sign problem in one of its guises, which is in most cases 
unavoidable in the treatment of fermionic or frustrated systems. Many ingenious attempts^ 
have been made to solve this problem, but this is still a topic of active research. However, 
as mentioned, we restrict ourselves in this chapter to the case of evolution operators G with 
nonnegative matrix elements only. 

We resume our discussion of the factorization given in Eq. (|29|) . Suppose for the sake of 
argument that the left eigenstate ipo of G is known and that its elements are positive, 

^o(S')G(S'|S) = A <MS). (30) 

S' 

If in addition, the matrix elements of G are nonnegative, the following matrix P is a Markov 
matrix 

P(S'|S) = ±MS')G(S'\S)J— (31) 
A o ^o(S) 
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Unless one is dealing with a Markov matrix from the outset, the left eigenvector of G is 
seldom known, but it is convenient, in any event, to perform a so-called importance sampling 
transformation on G. For this purpose we introduce a guiding function u g and define 

G(S'|S) = % (S')G(S'|S)^- (32) 

We shall return to the issue of the guiding function, but for the time being the reader can 
think of it either as an arbitrary, positive function, or as an approximation to the dominant 
eigenstate of G. From a mathematical point of view, anything that can be computed with 
the original Monte Carlo evolution operator G can also be computed with G, since the two 
represent the same abstract operator in a different basis. The representations differ only by 
normalization constants. All we have to do is to write all expressions derived above in terms 
of this new basis. 

We continue our discussion in terms of the transform G and replace Eq. fl29|) by the 
factorization 

G(S'|S) = £(S'|S)P(S'|S) (33) 

and we assume that P has the explicitly known distribution u g as its stationary state. The 
guiding function u g appears in those expressions, and it should be kept in mind that they 
can be reformulated by means of the reweighting procedure given in Eq. fl2~8|) to apply to 
processes with different explicitly known stationary states. On the other hand, one might be 
interested in the infinite projection limit p — ► oo. In that case, one might use a Monte Carlo 
process for which the stationary distribution is not known explicitly. Then, the expressions 
below should be rewritten so that the unknown distribution does not appear in expressions 
for the time averages. The function u g will still appear, but only as a transformation known 
in closed form and no longer as the stationary state of P. Clearly, a process for which the 
distribution is not known in closed form cannot be used to compute the matrix elements 
X^b for finite p and p' and given states \u a ) and \up). 

One possible choice for P that avoids the Metropolis method and produces finite weights 
is the following generalized heat bath transition matrix 

p = G(S'\S) 
E Bl G(Si|S) 

If £r(S'|S) is symmetric, this transition matrix has a known stationary distribution, viz., 
Crg(S)ttg(S), where Gg(S) = (S\G\u g ) / u g (S) , the configurational eigenvalue of G in state S. 
P must be chosen such that the corresponding transitions can be sampled directly. This is 
usually not feasible unless P is sparse or near-diagonal, or can be transformed into a form 
involving non- interacting degrees of freedom. We note that if P is defined by Eq. (|34]), the 
weight matrix g depends only on S. 



C. Matrix Elements 

We now address the issue of computing the matrix elements X^g' p \ assuming that the 
stationary state u\ is known explicitly and that the weight matrix g has finite elements. 
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We shall discuss the following increasingly complex possibilities: (a) [X, G] = and X is 
near-diagonal in the S representation; (b) X is diagonal in the S representation; (c) X(S|S') 
is truly off-diagonal. The fourth case, viz., [X, G] = and X is not near-diagonal is omitted 
since it can easily be constructed from the three cases discussed explicitly When discussing 
case (c) we shall introduce the concept of side walks and explain how these can be used to 
compute matrix elements of a more general nature than discussed up to that point. After 
deriving the expressions, we shall discuss the practical problems they give rise to, and ways 
to reduce the variance of the statistical estimators. Since this yields expressions in a more 
symmetric form, we introduce the transform 

X(S'\S) = u g (S')X(S'|S)^— . (35) 



1. [X, G] = and X Near-Diagonal 

In this case, X^ +p ^ = X^*'^ and it suffices to consider the computation of X^ p \ By 
repeated insertion in Eq. flT5| ) of the resolution of the identity in the S-basis, one obtains 
the expression 



2£(u,p) _ T,s p ,.. ., s UgjSp) X a (S p ) [l\ P =o G(S i+1 \Sj)} upjSp) _ 
£s p ,...,So u <*( s p) m£o G (Si+i|Si)] «^(S ) 



'(0,P) 



In the steady state, a series of subsequent states S t , S t+ i, . . . , S t+P occurs with probability 
Prob(S t , St+i, . . . , S t+ p) oc [ J| P(S t+ i+i|S t+i )] Ug(S t ) 2 . (37) 

i=0 

To relate products of the matrix P to those of G, it is convenient to introduce the following 
definitions 

p-i 

W t (p,q) = l[g(S t+i+1 \S t+i ) (38) 

i=q 

Also, we define 

MS) = (39) 
w g (S) 

where u can be any of a number of subscripts. 

With these definitions, combining Eqs. (|29|), (|36|), and (|3~7D , one finds 

x (o,p) = Um g|=i u a (S t+p ) X a (S t+p ) W t (p,0)up(S t ) ^ 



Et=i u a (S t+p )W t (p,0)u p (S t ) 
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2. Diagonal X 



The preceding discussion can be generalized straightforwardly to the case in which X 
is diagonal in the S representation. Again by repeated insertion of the resolution of the 
identity in the S-basis in the Eq. fll5| ) for , one obtains the identity 

ferp) = e v+p ,...,so M$ P ' +P ) [nt + /' lG ( s m|s,)]x(s p |s p ) [nr=oG(s m |s t )]^(s ) 

a " Ss p , +P ,...,s «a( W [nt + o P ^ G(S m |S,)] ^(S ) 

(41) 

Again by virtue of Eq. ([J7|), we find 

W t (p' + p,p)X(S t+p \S t+p ) W t (p,0)MSt) , An . 
x »p = J™ L : ^ rr^yr- — — ^ x • l 42 J 



Et=i u a (S t+p/+ p) W t (p' + p, 0)up(S t ) 



3. Nondiagonal X 

If the matrix elements of G vanish only if those of X do, the preceding method can be 
generalized immediately to the final case in which X is nondiagonal. Then, the analog of 
Eq. (H) is 



x (p',p) = lim S|=i u a (S t+p > +p ) W t (p' + p,p + l)g(S f +p + i|St+p) Wt(p,0)up(St) , 43 s 

^ " E[ = l «a(S t+p , +p ) W t (p' + P , 0)fy(S t ) 

where the x matrix elements are defined by 

I(S <|S) = = *<^> (44) 

X(b|b ' P(S'|S) P(S'|S)' 11 

Clearly, the preceding definition of x(S'|S) fails when P(S'|S) vanishes but X(S'|S) does not. 
If that can happen, a more complicated scheme can be employed in which one introduces 
side-walks. This is done by interrupting the continuing stochastic process at time t + p by 
introducing a finite series of auxiliary states SJ +p+1 , . . . , S' t+pl+p . The latter are generated 
by a separate stochastic process so that in equilibrium, the sequence of subsequent states 
S i; S i+ i, . . . , Sf +P , S't+p+n • • • > S Up'+p occurs with probability 

Prob[(S t , St+i, . . . , S t+P , S' t+p+1 , . . . , S' t+pl+p )] oc 

[nftrx 1 P(s' t+i+1 \s f t+i )} P x (s' t+p+1 \s t+p ) into p(s w+1 |s t+i )K(s t ) 2 (45) 

where Px is a Markov matrix chosen to replace P in Eq. so as to yield finite weights x. 
In this scheme, one generates a continuing thread identical to the usual Monte Carlo process 
in which each state S t is sampled from the stationary state of P, at least if one ignores the 
initial equilibration. Each state St of this backbone forms the beginning of a side walk, the 
first step of which is sampled from Px, while P again generates subsequent ones. Clearly, 
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with respect to the side walk, the first step disrupts the stationary state, so that the p' states 
S' t ,, which form the side walk, do not sample the stationary state of the original stochastic 
process generated by P, unless Px coincidentally has the same stationary state as P. 

A problem with the matrix elements we dealt with up to now is that in the limit p' or 
p —>■ oo all of them reduce to matrix elements involving the dominant eigenstate, although 
symmetries might be used to yield other eigenstates besides the absolute dominant one. 
However, if symmetries fail, one has to employ the equivalent of an orthogonalization scheme, 
such as, discussed in the next section, or one is forced to resort to evolution operators that 
contain, in exact or in approximate form, the corresponding projections. An example of this 
are matrix elements computed in the context of the fixed-node approximation!!!, discussed 



in Section [VI A | . Within the framework of this approximation, one considers quantities of 
the form 

x (p',p) = (u a \G'(XGl\u ) ^ 
{u a \GY\u a ){up\Gf\up) 

where the Gi are evolution operators combined with appropriate projectors, which in the 
fixed-node approximation are defined by the nodes of the states u a (S) and Mg(S). We shall 
describe how the preceding expression, Eq. (^), can be evaluated, but rather than writing 
out all the expressions explicitly, we present just the essence of the Monte Carlo method. 

To deal with these expressions, one generates a backbone time series of states sampled 
from any distribution, say, u g (S) 2 , that has considerable overlap with the the states |w a (S)| 
and |wg(S)|. Let us distinguish those backbone states by a superscript 0. Consider any such 
state S (0) at some given time. It forms the starting point of two side walks. We denote the 

(i) 

states of these side walks by SJ. where i — 1,2 identifies the side walk and tj labels the 
side steps. The side walks are generated from factorizations of the usual form, defined in 
Eq. (p3|), say Gi = QiPi. A walk 



5 = [SW,(sS 1) ,s( 1) ,...),(Sf ) ,S? ) ,...)] (47) 

occurs with probability 

Prob(S) = M g (S(0)) 2 A(S (0) |SS 1) ).--A(S (0) |SS 2) ).--- (48) 

We leave it to the reader to show that this probability suffices for the computation of all 
expressions appearing in numerator and denominator of Eq. (pE6|), in the case that X is 
diagonal, and to generate the appropriate generalizations to other cases. 

In the expressions derived above, the power method projections precipitate products of 
reweighting factors g, and, as the projection times p and p' increase, the variance of the 
Monte Carlo estimators grows at least exponentially in the square root of the projection 
time. Clearly, the presence of the fluctuating weights g is due to the fact that the evolution 
operator G is not Markovian in the sense that it fails to conserve probability. The importance 
sampling transformation Eq. fl32|) was introduced to mitigate this problem. In Section [V], an 
algorithm involving branching walks will be introduced, which is a different strategy devised 
to deal with this problem. In diffusion and transfer matrix Monte Carlo, both strategies, 
importance sampling and branching, are usually employed simultaneously. 
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D. Excited States 



Given a set of basis states, excited eigenstates can be computed variationally by solving 
a linear variational problem and the Metropolis method can be used to evaluate the required 
matrix elements. The methods involving the power method, as described above, can then 
be used to remove the variational bias systematically. 

In this context matrix elements appear in the solution of the following variational prob- 
lem. As was mentioned several times before, the price paid for reducing the variational bias 
is increased statistical noise, a problem which appears in this context with a vengeance. 
Again, the way to keep this problem under control is the use of optimized trial vectors. 

The variational problem to be solved is the following one. Given n basis functions \ui), 
find the n x n matrix of coefficients such that 

n 

fe^E^k) (49) 

8=1 

are the best variational approximations for the n lowest eigenstates of some Hamiltonian 
TC. In this problem we shall use the language of the quantum mechanical systems, where one 
has to distinguish the Hamiltonian from the evolution operator exp(— tTC). In the statistical 
mechanical applications, one has only the equivalent of the latter. In the expressions to be 
derived below the substitution 7iG p — > G p+1 will produce the expressions required for the 
statistical mechanical applications, at least if we assume that the nonsymmetric matrices 
that appear in that context have been symmetrized.^] 

One seeks a solution to the linear variational problem in Eq. (HSJ) in the sense that for 



all i the Rayleigh quotient {ipi^H^i) / {ijj^ipi) is stationary with respect to variation of the 
coefficients d. The solution is that the matrix of coefficients d has to satisfy the following 
generalized eigenvalue equation 



Y,H kl dT = E^N kl d!f\ (50) 

i=l i=l 

where 

H ki = (uk\H\ui), (51) 

and 

N ki = (u*h). (52) 

Before discussing Monte Carlo issues, we note a number of important properties of this 
scheme. First of all, the basis states |uj) in general are not orthonormal and this is reflected 
by the fact that the matrix elements of N have to be computed. Secondly, it is clear that any 



^This is not possible in general for transfer matrices of systems with helical boundary conditions, 
but the connection between left and right eigenvectors of the transfer matrix (see Section |IB| ) can 
be used to generalize the approach discussed here. 
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nonsingular linear combination of the basis vectors will produce precisely the same results, 
obtained from the correspondingly transformed version of Eq. (pOf). The final comment is 
that the variational eigenvalues bound the exact eigenvalues, i.e., Ei > E^. One recovers 
exact eigenvalues Ei and the corresponding eigenstates, if the \ui) span the same space as 
the exact eigenstates. 

The required matrix elements can be computed using the variational Monte Carlo method 
discussed in the previous section. Furthermore, the power method can be used to reduce 
the variational bias. Formally, one simply defines new basis states 

k (p) > = G p k) (53) 

and substitutes these new basis states for the original ones. The corresponding matrices 

= (ul P) \H\uf) (54) 

and 

^? = <4 P) k (p) > (55) 



can again be computed by applying the methods introduced in Section |T| for the computa- 
tion of general matrix elements by a Monte Carlo implementation of the power method. 

As an explicit example illustrating the nature of the Monte Carlo time-averages that 
one has to evaluate in this approach, we write down the expression for N-j as used for the 
computation of eigenvalues of the Markov matrix relevant to the problem of critical slowing 
down: 

^ )w E ?siy v (56) 

where the S t are configurations forming a time series which, as we recall, is designed to 
sample the distribution of a system in thermodynamic equilibrium, i.e., the Boltzmann 
distribution ip^. The expression given in Eq. (|56D yields the ti/V'B-auto-correlation function 
at lag p. The expression for H-?' is similar, and represents a cross-correlation function 
involving the configurational eigenvalues of the Markov matrix in the various basis states. 
Compared to the expressions derived in Section |H]j , Eq. fl56|) takes a particularly simple form 
in which products of fluctuating weights are absent, because in this particular problem one 
is dealing with a probability conserving evolution operator from the outset. 

Eq. ( [5T)D shows why this method is sometimes called correlation function Monte Carlo, 
but it also illustrates a new feature, namely, that it is efficient to compute all required 
matrix elements simultaneously. This can be done by generating a Monte Carlo process 
with a known distribution which has sufficient overlap with all |«i(S)| = |(S|it»)|. This can 
be arranged, for example, by sampling a guiding function i%(S) defined by 



uJS) 



\ 



I>"<(S) 2 , (57) 



1=1 



where the coefficients <2j approximately normalize the basis states \v,i), which may require 
a preliminary Monte Carlo run. See Ceperley and BernuEl for an alternative choice for a 
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guiding function. In the computations^ to obtain the spectrum of the Markov matrix in 
critical dynamics, as illustrated by Eq. fl56[) , the Boltzmann distribution, is used as a guiding 
function. It apparently has enough overlap with the decaying modes that no special purpose 
distribution has to be generated. 

E. How to Avoid Reweighting 

Before discussing the branching algorithms designed to deal more efficiently with the 
reweighting factors appearing in the expressions discused above, we briefly mention an al- 
ternative that has surfaced occasionally without being studied extensively, to our knowledge. 
The idea will be illustrated in the case of the computation of the matrix element X^ p \ and 



we take Eqn. fl36|) as our starting point. In statistical mechanical language, we introduce a 
reduced Hamiltonian 

p-i 

H = \nu g (S p ) + J2 In G(S t+1 1 S 4 ) + In u g (S ) (58) 

i=0 

and the corresponding Boltzmann distribution exp — TC(S P , . . . , So). One can now use the 
standard Metropolis algorithm to sample this distribution for this system consisting of p + 1 
layers bounded by the layers and p. For the evaluation of Eq. fl55p by Monte Carlo, 
this expression then straightforwardly becomes a ratio of correlation functions involving 
quantities defined at the boundaries. To see this, all one has to do is to divide the numerator 
and denominator of Eq. fl36|) by the partition function 

Z= J2 e- w < s "-' So > (59) 

S p ,...,So 

Note that in general, boundary terms involving some appropriately defined u g should be 
introduced to ensure the non-negativity of the distribution. For the simultaneous compu- 
tation of matrix elements for several values of the indices a and /3, a guiding function u g 
should be chosen that has considerable overlap with the corresponding \u a \ and \up\. 

The Metropolis algorithm can of course be used to sample any probability distribution, 
and the introduction of the previous Hamiltonian illustrates just one particular point of 
view. If one applies the preceding idea to the case of the imaginary-time quantum mechanical 
evolution operator, one obtains a modified version of the standard path-integral Monte Carlo 
method, in which case the layers are usually called time slices. Clearly, this method has the 
advantage of suppressing the fluctuating weights in estimators. However, the disadvantage 
is that sampling the full, layered system yields a longer correlation time than sampling the 
single-layer distribution u 2 g . This is a consequence of the fact that the microscopic degrees 
of freedom are more strongly correlated in a layered system than in a single layer. Our 
limited experience suggests that for small systems reweighting is more efficient, whereas the 
Metropolis approach tends to become more efficient as the system grows in size.0 

IV. TRIAL FUNCTION OPTIMIZATION 

In the previous section it was shown that eigenvalue estimates can be obtained as the 
eigenvalues of the matrix H^. The variational error in these estimates decreases 
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as p increases. In general, these expressions involve weight products of increasing length, 
and consequently the errors grow exponentially, but even in the simple case of a probability 
conserving evolution operator, errors grow exponentially. This is a consequence of the fact 
that the auto-correlation functions in N^ p \ and the cross-correlation functions in H^ p \ in 
the limit p — ► oo reduce to quantities that contain an exponentially vanishing amount of 
information about the subdominant or excited-state eigenvalues, since the spectral weight 
of all but the dominant eigenstate is reduced to zero by the power method. 

The practical implication is that this information has to be retrieved with sufficient 
accuracy for small values of p, before the signal disappears in the statistical noise. The 
projection time p can be kept small by using optimized basis states constructed to reduce 
the overlap of the linear space spanned by the basis states \ui) with the space spanned by 
the eigenstates beyond the first n of interest. We shall describe, mostly qualitatively, how 
this can be done by a generalization of a method used for optimization of individual basis 
stateJH~iI 5 viz. minimization of variance of the configurational eigenvalue, the local energy 
in quantum Hamiltonian problems. 

Suppose that ut(S,v) is the value of the trial function ut for configuration S and some 
choice of the parameters v to be optimized. As in Eq. dl8|), the configurational eigenvalue 
\(S, v) of configuration S is defined by 



where a prime is used to denote, for arbitrary |/), the components of G\f), or TC\f) as is 
more convenient for quantum mechanical applications. The optimal values of the variational 
parameters are obtained by minimization of the variance of A(S, v), estimated as an average 
over a small Monte Carlo sample. In the ideal case, i.e., if an exact eigenstate can be 
reproduced by some choice of the parameters of «t, the minimum of the variance yields the 
exact eigenstate not only if it were to be computed exactly, but even if it is approximated 
by summation over a Monte Carlo sample. A similar zero- variance principle holds for the 
method of simultaneous optimization of several trial states to be discussed next. This is 
in sharp contrast with the more traditional Rayleigh-Ritz extremization of the Rayleigh 
quotient, which frequently can produce arbitrarily poor results if minimized over a small 
sample of configurations. 

For conceptual simplicity, we first generalize the preceding method to the more general 
ideal case that reproduces the exact values of the desired n eigenstates of the evolution 
operator G. As a byproduct, our discussion will produce an alternative to the derivation of 
Eq. fl50p. To compute n eigenvalues, we have to optimize the n basis states \ui), where we 
have dropped the index "T" , and again we assume we have a sample of M configurations S a , 
a — 1, . . . , M sampled from u 2 g . The case we consider is ideal in the sense that we assume 
that these basis states \ui) span an n-dimensional invariant subspace of G. In that case, by 
definition there exists a matrix A of order n such that 



Again, the prime on the left-hand side of this equation indicates multiplication by G or 
by 7i = — t -1 lnG. If the number of configurations is large enough, A is for all practical 
purposes determined uniquely by the set of equations (|6l|) and one finds 




(60) 



n 




(61) 
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A = N^H, 



(62) 



where 

Nij = iE^i^(-S Q )^(^)K(^) 2 , 

Ha = ± Ea=i UiiS^iS^/u^f. (63) 

In the limit M — > oo this indeed reproduces the matrices N and if in Eq. fl50|). In the 
nonideal case, the space spanned by the n basis states |uj) is not an invariant subspace of 
the matrix G. In that case, even though Eq. ( pTj ) generically has no true solution, Eqs. flB2p 
and ( |63D still constitute a solution in the least-squares sense, as the reader is invited to show 
for himself by solving the normal equations. 

Next, let us consider the construction of a generalized optimization criterion. As men- 
tioned before, if a set of basis states span an invariant subspace, so does any nonsingular 
linear combination. In principle, the optimization criterion should have the same invariance. 
The matrix A lacks this property, but its spectrum is invariant. Another consideration is 
that, while the local eigenvalue is defined by a single configuration S, it takes at least n 
configurations to determine the "local" matrix A. This suggests that one subdivide the 
sample into subsamples of at least n configurations each and minimize the variance of the 
local spectrum over these subsamples. Again in principle, this has the advantage that the 
optimization can exploit the fact that linear combinations of the basis states have more 
variational freedom to represent the eigenstates than does each variational basis function 
separately. In practice, however, this advantage seems to be negated by the difficulty of find- 
ing good optimal parameters. This is a consequence of the fact that invariance under linear 
transformation usually can be mimicked by variation of the parameters of the basis states. 
In other words, a linear combination of basis states can be represented accurately, at least 
relative to the noise in the local spectrum, by a single basis state with appropriately chosen 
parameters. Consequently, intrinsic flaws of the trial states exceed what can be gained in 
reducing the variance of the local spectrum by exploiting the linear variational freedom, 
most of which is already used anyway in the linear variational problem that was discussed 
at the beginning of this section. This means that one has to contend with a near-singular 
nonlinear optimization problem. In practice, to avoid the concomitant slow convergence, 
it seems to be more efficient to break the "gauge symmetry" and select a preferred basis, 
which most naturally is done by requiring that each basis state itself is a good approximate 
eigenstate. 

The preceding considerations, of course, leave us with two criteria, viz. minimization 
of the variance of the local spectrum as a whole, and minimization of the variance of the 
configurational eigenvalue separately. To be of practical use, both criteria have to be com- 
bined, since if one were to proceed just by minimization of the variance of the configurational 
eigenvalues separately, one would simply keep reproducing the same eigenstate. In a non- 
Monte Carlo context this can be solved simply by some orthogonalization scheme, but as 
far as Monte Carlo is concerned, that is undesirable since it fails to yield a zero-variance 
optimization principle. 
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V. BRANCHING MONTE CARLO 



In Section |IIJ we discussed a method to compute Monte Carlo averages by exploiting the 
power method to reduce the spectral weight of undesirable, subdominant eigenstates. We 
saw that this leads to products of weights of subsequent configurations sampled by a Monte 
Carlo time series. To suppress completely the systematic errors due to finite projection 
times, i.e., the variational bias, one has to take averages of infinite products of weights. 
This limit would produce an "exact" method with infinite variance, which is of no practical 
use. 

We have also discussed how optimized trial states can be used to reduce the variance of 
this method. The variance reduction may come about in two ways. In the first place, by 
starting with optimized trial states of higher quality, the variational bias is smaller to begin 
with so that fewer power method projections are required. In practical terms, this leads to a 
reduction of the number of factors in the fluctuating products. Secondly, a good estimate of 
the dominant eigenstate, can be used to reduce the amount by which the evolution operator, 
divided by an appropriate constant, violates conservation of probability, which reduces the 
variance of the individual fluctuating weight factors. All these considerations also apply 
to the branching Monte Carlo algorithm discussed in this section, which can be modified 
accordingly and in complete analogy with our previous discussion. 

Before discussing the details of the branching algorithm, we mention that the algorithm 
presented here! contains the mathematical essence of both the diffusion and transfer matrix 
Monte Carlo algorithms. A related algorithm, viz., Green function Monte Carlo, adds yet 
another level of complexity due to the fact that the evolution operator is known only as an 
infinite series. This series is stochastically summed at each step of the power method itera- 
tions. In practice this implies that even the time step becomes stochastic and intermediate 
Monte Carlo configurations are generated that do not contribute to expectation values. Nei- 
ther Green function Monte Carlo, nor its generalization designed to compute quantities at 
non-zero temperature^, will be discussed in this chapter and we refer the interested reader 
to the literature for further details.il~§l 

Let us consider in detail the mechanism that produces large variance. This will allow 
us to explain what branching accomplishes if one has to compute products of many (ideally 
infinitely many) fluctuating weights. The time average over these products will typically be 
dominated by only very few large terms; the small terms are equally expensive to compute, 
but play no significant role in the average. This problem can be solved by performing many 
simultaneous Monte Carlo walks. One evolves a collection of walkers from one generation to 
the next and the key idea is to eliminate the light-weight walkers which produce relatively 
small contributions to the time average. To keep the number of walkers reasonably con- 
stant, heavy-weight walkers are duplicated and the clones are subsequently evolved (almost) 
independently. 

An algorithm designed according to this concept does not cut off the products over 
weights and therefore seems to correspond to infinite projection time. It would therefore 
seem that the time average over a stationary branching process corresponds to an average 
over the exact dominant eigenstate of the Monte Carlo evolution operator, but, as we shall 
see, this is rigorously the case only in the limit of an infinite number of walkerssc3; for any 
finite number of walkers, the stationary distribution has a bias inversely proportional to the 
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number of walkers, the so-called population control bias. If the fluctuations in the weights 
are small and correlations (discussed later) decay rapidly, this bias tends to be small. In 
many applications this appears to be the case and the corresponding bias is statistically 
insignificant. However, if these methods are applied to statistical mechanical systems at the 
critical point, significant bias can be introduced. We shall discuss a simple method of nearly 
vanishing computational cost to detect this bias and correct for it in all but the worst-cases 
scenarios. 

To discuss the branching Monte Carlo version of the power method, we continue to 
use the notation introduced above and again consider the Monte Carlo evolution operator 
G(S'|S). As above, the states S and S' will be treated here as discrete, but in practice the 
distinction between continuous and discrete states is a minor technicality, and generalization 
to the continuous case follows immediately by replacing sums by integrals and by replacing 
Kronecker <5's by Dirac 5 functions. 

To implement the power method iterations in Eq. (|I^) by a branching Monte Carlo 
process, \u^>) is represented by a collection of N t walkers, where a walker by definition 
is a state-weight pair (S Q , w a ), a = 1, . . . , N t . As usual, the state variable S Q represents a 
possible configuration of the system evolving according to G, and w a represents the statistical 
weight of walker a. These weights appear in averages and the efficiency of the branching 
Monte Carlo algorithm is realized by maintaining the weights in some range ui\ < w a < w n , 
where w\ and w u are bounds introduced so as to keep all weights w a of the same order of 
magnitude. 

The first idea is to interpret a collection of walkers that make up generation t as a 
representation of the (sparse) vector \u^) with components 

N t 

(S|««)^M (i) (S) = E^s,s Q , (64) 

a=l 

where S is the usual Kronecker 5-function. The underbar is used to indicate that the w^(S) 
represent a stochastic vector \u®). Of course, the same is true formally for the single 
thread Monte Carlo. The new feature is that one can think of the collective of walkers as a 
reasonably accurate representation of the stationary state at each time step, rather than in 
the long run. 

The second idea is to define a stochastic process in which the walkers evolve with tran- 
sition probabilities such that the expectation value of Q + i|tt^ +1 ^), as represented by the 
walkers of generation t + 1, equals G\yP^>) for any given collection of walkers representing 
\u®). It is tempting to conclude that, owing to this construction, the basic recursion re- 
lation of the power method, Eq. (|14|) , is satisfied in an average sense, but this conclusion 
is not quite correct. The reason is that in practice, the constants c t are defined on the fly. 
Consequently, c t +i and \uj- t+1 ') are correlated random variables and therefore there is no 
guarantee that the stationary state expectation value of \u®) is exactly an eigenstate of G, 
except in the limit of nonfluctuating normalization constants c t , which, as we shall see, is 
tantamount to an infinite number of walkers. More explicitly, the problem is that if one 



takes the time average of Eq. (14) and if the fluctuations of the Ct+i are correlated with 
\u^) or|M^ +1 ^) one does not produce the same state on the left- and right-hand sides of the 
time-averaged equation and therefore the time-averaged state need not satisfy the eigenvalue 
equation. The resulting bias has been discussed in the various contexts-BB^I 
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One way to define a stochastic process is to rewrite the power method iteration Eq. (|i"4"|) 

as 

„<*W)(S') = — £p(S'|S)$(S)ti«(S), (65) 
c t +i s 

where 

p(S) = ^G(S'IS) and P(S'|S) = G(S'|S)/#(S). (66) 

S' 

This is in fact what how transfer matrix Monte Carlo is defined. Referring the reader back 
to the discussion of Eq. , we note that in diffusion Monte Carlo the weight D is defined 
so that it is not just a function of the initial state S, but also depends on the final S'. 
The algorithm given below can trivially be generalized to accommodate this by making the 
substitution g(S) — > g(S'\S). 

Equation (]65f) describes a process represented by a Monte Carlo run which, after a few 
initial equilibration steps, consists of a time series of M updates of all walkers at times 
labeled by t = . . . , 0, 1, . . . , M . The update at time t consists of two steps designed to 
perform stochastically the matrix multiplications in Eq. (|65|). Following Nightingale and 
B16te,S the process is defined by the following steps. Let us consider one of these updates, 
the one that transforms the generation of walkers at time t into the generation at time t + 1. 
We denote variables pertaining to times t and t + 1 respectively by unprimed and primed 
symbols. 

1. Update the old walker (Sj,itfj) to yield a temporary walker (S'j,uQ according to the 
transition probability P(S'j|Sj), where w[ = g(Si)iUi/c', for i — 1, . . . ,N t . Step two, 
defined below, can change the number of walkers. To maintain their number close to 
a target number, say N , choose c' = A (A^/iVo) 1//s , where A is a running estimate of 
the eigenvalue A to be calculated, where s > 1 [see discussion after Eq. (|68D1. 

2. From the temporary walkers construct the new generation of walkers as follows 

(a) Split each walker (S', w') for which w' > b n into two walkers (S', \w'). The choice 
b u = 2 is a reasonable one. 

(b) Join pairs (S'j,wQ and (S'j,w'j) with w\ < b\ and w'j < b\ to produce a single 
walker (S'^,^ + w'j), where S'^ = S'j or S'^ = Sj with relative probabilities w[ 
and w'j. Here b\ — 1/2 is reasonable. 

(c) Walkers for which b\ < w ■ < b u , or left single in step [2b|, become members of the 
new generation of walkers. 

Note that, if the weights g(S) fluctuate on average more than by a factor of two, multiple 
split and join operations may be needed. 

It may help to explicate why wildly fluctuating weight adversely impact the efficiency 
of the algorithm. In that case, some walkers will have multiple descendents in the next 
generations, whereas others will have none. This leads to an inefficient algorithm since any 
generation will have several walkers that are either identical or are closely related, which 
will produce strongly correlated contributions to the statistical time averages. In its final 
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analysis, this is the same old problem that we encountered in a single-thread algorithm, 
where averages would be dominated by few terms with relatively large, explicitly given 
statistical weights. Branching mitigates this problem since walkers that are descendants of 
a given walker eventually decorrelate, but, as discussed in Sections |TTT| and |V]], the best cure 
is importance sampling and in practice both strategies are used simultaneously. 

The algorithm described above was constructed so that for any given realization of \u^), 
the expectation value of Ct_j_i |m ) ? i n accordance with Eq. QI3j), satisfies 



E 



c m |M (m) )l =G\u^), (67) 



where E(-) denotes the conditional average over the transitions defined by the preceding 
stochastic process. More generally, by p-fold iteration one findsE3 




G p \u {t) ). (68) 

The stationary state average of \u®) is close to the dominant eigenvector of G, but, 
as mentioned above, it has a systematic bias, proportional to 1/N t , when the number N t 
of walkers is finite. If, as is the case in some applications, this bias exceeds the statistical 
errors, one can again rely on the power method to reduce this bias by increasing p. If that 
is done, one is back to the old problem of having to average products of fluctuating weights, 
and, as usual, the variance of the corresponding estimators increases as their bias decreases. 
Fortunately, in practice the population control bias of the stationary state is quite small, if 
at all detectable, but even in those cases, expectation values involving several values of p 
should be computed to verify the absence of population control bias. The reader is referred 



to Refs. |2] JT2| , |31| -p3| for a more detailed discussion of this problem. Suffice it to mention here, 
first of all, that s, as defined in the first step of the branching algorithm given above is the 
expected number of time steps it takes to restore the number of walkers to its target value 
No and, secondly, that strong population control (s = 1) tends to introduce a stronger bias 
than weaker control (s > 1). 

With Eq. ( |5"8"D one constructs an estimatoril of the dominant eigenvector |tt(°°)) of the 
evolution operator G: 



\u {p) ) 




(69) 



For p = 0, in which case the product over b reduces to unity, this yields the stationary 
state of the branching Monte Carlo which frequently is treated as the dominant eigenstate 
of G. 

Clearly, this branching Monte Carlo algorithm can be used to compute the right-projected 
mixed estimators that were denoted by X^f in Section [TTJ. For this purpose one sets p' — 



in Eq. ( |I5|) and makes the substitution (u a \ = (ut\ and G p \up)\ = \u^'). Expressions for 
several values of p can be computed simultaneously and virtually for the price of one. We 
explicitly mention the important special case obtained by choosing for the general operator 
X the evolution operator G itself. This yields the following estimator for the dominant 
eigenvalue A of G: 



25 



Ao ~E«-» I (nL-^_ 4 )w<'-')' (70) 



where 



^w = x;t«f ) «T(s i ). (71) 

In diffusion Monte Carlo this estimator can be used to construct the growth estimate of the 
ground state energy. That is, since in that special case G ~ exp(— rH), eigenvalues of the 
evolution operator and the Hamiltonian are related by 

E = --ln\ . (72) 

T 



Besides expressions such as Eq. (|70|) one can construct expresssions with reduced vari- 
ance. These involve the configurational eigenvalue of G or "H in the same way this was done 
in our discussion of the single-thread algorithm. 

Again, in practical applications it is important to combine the raw branching algorithm 
defined above with importance sampling. Mathematically, this works in precisely the same 
way as in Section |TTTj in that one reformulates the same algorithm in terms of the similarity 
transform G with u g = «t chosen to be an accurate, approximate dominant eigenstate [see 
Eq. (|32|)1. In the single-thread algorithm, the result is that the fluctuations of the weights 
g and their products are reduced. In the context of the branching algorithm, this yields 
reduced fluctuations in the weight of walkers individually and in the size of the walker 
population. One result is that the population control bias is reduced. If we ignore this 
bias, a more fundamental difference is that the steady state of the branching algorithm is 
modified. That is, in the raw algorithm the walkers sample the dominant eigenstate of G, 
i.e., V'o(S), but, if the trial state \ur) is used for importance sampling, the distribution is 
Wt(S)^>o(S), which, of course, is simply the dominant eigenstate of G. 

So far, we have only discussed how mixed expectation values can be computed with 
the branching Monte Carlo algorithm and, as was mentioned before, this yields the desired 
result only if one deals with operators that commute with the evolution operator G. This 
algorithm can, however, also be used to perform power method projections to the left. In 
fact, most of the concepts discussed in Sections |III C lyTTl C 2j , and |IIIC~3| can be implemented 
straightforwardly. To illustrate this point we shall show how one can compute the left- and 
right-projected expectation value of a diagonal operator X. Since the branching algorithm 
is designed to explicitly perform the multiplication by G including all weights, all that is 
required is the following generalization!^, called forward or future walking. 

Rather than defining a walker to be the pair formed by a state and a weight, for for- 
ward walking we define the walker to be of the form [S, w, X(S-\ ),..., X(S_ p /)], where 
S_i,S_ 2 , ... are previous states of the walker. In other words, each walker is equipped 
with a finite stack of depth p' of previous values of the diagonal operator X. In going from 
one generation of walkers to the next, the state and weight of a walker are updated just as 
before to S' and w'. The only new feature is that the value X(S) is pushed on the stack: 
[S, w, X(S_i), X(S- P ')] — > [S', w r , X(S), X(S-i), X(S_p/+i)]. In this way, the p' times 
left-projected expectation value of X is obtained simply by replacing X(S) by X(S_ p /). 
Note that one saves the history of X rather than a history of configurations only for the 
purpose of efficient use of computer memory. 
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VI. DIFFUSION MONTE CARLO 



A. Simple Diffusion Monte Carlo Algorithm 

The diffusion Monte Carlo methodlH^ discussed in this section is an example of the 
general class of algorithms discussed in this chapter, all of which rely on stochastic imple- 
mentation of the power method to increase the relative spectral weight of the eigenstates of 
interest. In the quantum mechanical context of the current section, the latter is usually the 
ground state. In this general sense, there is nothing new in this section. However, a number 
of features enter that lead to technical challenges that cannot be dealt with in a general 
setting. There is the problem that the projector, the evolution operator G = exp(— tTC), 
is only known in the limit of small imaginary time r, and in addition, for applications to 
electronic structure problems, there are specific practical difficulties associated with nodes 
and cusps in the wave functions. Bosonic ground states are simpler in that they lack nodes, 
and we shall deal with those systems only in passing. As in the rest of this chapter, we 
deal only with the basic concepts. In particular, we focus on considerations relevant to the 
design of an efficient diffusion Monte Carlo algorithm. The reader is referred to Refs. 41-43 
for applications and other issues not covered here. 

For quantum mechanical problems, the power method iteration Eq. ( |14"|) takes the form 

\^ t + T )) =e -{H-E T )r (?3) 

Here Et is a shift in energy such that Eq — E^ ~ 0, where Eq is the ground state energy. 
In the real-space representation we have 

^(R',t + r) = JdR (RV^- Et)t |R) (R\ip(t)) = JdR G(R',R,r) ^(R,t), (74) 

In practical Monte Carlo settings, the shift E^ is computed on the fly and consequently 
is a slowly varying, nearly constant function of time, but for the moment we take it to be 
constant. The wavefunction ip(R, t) is the solution to the Schrodinger equation in imaginary 
time 

- ^VV(R, t) + [V(R) - £tMR, t) = _MM . (75) 



To make contact with Eq. fl29|) we should factor the Green function into a probability 
conserving part and a weight. For small times, this can be accomplished by the following 
approximation. If, on the left-hand side, we just had the first term, Eq. (|75|) would reduce 
to a diffusion equation, whence the method gets its name. For diffusion, the Green function 
is probability conserving and is given by 

- M (R.'-R-) 2 /2r 

P(R ' R ' T) - (w* - (76) 



If, on the other hand, we just had the second term then Eq. ( fT5| ) would reduce to a rate 
equation for which the Green function is S(R' — R) with prefactor 

g(R', R, r) = e ^ T -[v(R)+v(R')]/2}_ (77) 



27 



In combining these ingredients, we have to contend with the following general relation 
for noncommuting operators Tii 

As long as one is dealing with a 5-function, the weight in Eq. ([77]) is evaluated always at 
R ' = R, and therefore the expression on the right can be written also in non-symmetric form. 
However, Eq. (|78|) suggests that the exponent in Eq. (f77|) should be used in symmetric fashion 
as written. This is indeed the form we shall employ for the time being, but we note that 
the final version of the diffusion algorithm employs a nonsymmetric split of the exponent, 
since proposed moves are not always accepted. Since there are other sources of 0(r 2 ) 
contributions anyway this does not degrade the performance of the algorithm. Combination 
of the preceding ingredients yields the following approximate, short-time Green function for 
Eq. 

p -MR'-R) 2 /2t 

G(R',R,t) = e - 7 ^— J -—- 2 e r{ifr-rv(R')+v(R)]/2} + 0(t sj 



(2txt I \if n l'< 

e -^(R'-R) 2 /2r 



e r[E T -V m + G(r2) (79) 



For this problem, the power method can be implemented by Monte Carlo by means 
of both the single-thread scheme discussed in Section |TT| and the branching algorithm of 
Section [V|. We shall use the latter option in our discussion. In this case, one performs the 
following steps. Walkers of the zeroth generation are sampled from ^(R, 0) = ?/>t(R) using 
the Metropolis algorithm. The walkers are propagated forward an imaginary time r by 
sampling a new position R' from a multivariate Gaussian centered at the old position R and 
multiplying the weights of the walkers by the second factor in Eq. (|75|). Then the split/join 
step of the branching algorithm is performed to obtain the next generation of walkers, with 
weights in the targeted range. 



1. Diffusion Monte Carlo with Importance Sampling 

For many problems of interest, the potential energy V(R) exhibits large variations over 
coordinate space and in fact may diverge at particle coincidence points. As we have discussed 
in the general case, the fluctuations of weights g, produce noisy statistical estimates. As 
described in Sections |T| and |V|, this problem can be greatly mitigated by applying the sim- 
ilarity (or importance sampling) transformationil'0 to the evolution operator. Employing 
the general mathematical identity S'exp(— T7i)S~ l = exp^—rSHS' 1 ), this transformation 
can be applied conveniently to the Hamiltonian. That is, given a trial function ^t(R) one 
can introduce a distribution /(R, t) = , T (R)'0(R, t). If ip(R,t) satisfies the Schrodinger 
equation jEq. (|75|)1, it is a simple calculus exercise to show that / is a solution of the 
equation@0 

Vt(R)(w-£; t )^t(R) _ V(R,0 = 

-^V 2 /(R, t) + V • [V(R)/(R, t)} - 5(R)/(R, t) = (80) 
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Here the velocity V is a function (not an operator), often referred to in the literature as the 
quantum force, and is given by 

<m 

The coefficient of the source term, which is responsible for branching in the diffusion Monte 
Carlo context, is 

S(K) = E T -E L (K), (82) 
which is defined in terms of the local energy 

^MR)_ 1 VVrCR) , 

El(r) " ~ "2^^mrT ( } ' ( } 

the equivalent of the configurational eigenvalue introduced in Eq. fll8|) . 

Compared to the original Schrodinger equation, to which of course it reduces for the 
case tpT = 1) the second term in Eq. flS0| ) is new, and corresponds to drift. Again, one 
can explicitly write down the Green function of the equation with just a single term on the 
left-hand side. The drift Green function Gb of the equation obtained by suppressing all but 
this term is 

G D (R',R,r) =5[R'-R(r)] (84) 
where R(t) satisfies the differential equation 

§ = V(fi) (85) 

subject to the boundary condition R(0) = R. Again, at short times the noncommutativity 
of the various operators on the left-hand side of the equation can be ignored and thus one 
obtains the following short-time Green function. 

G(R',R,r)= fdR" <5[R"-R(r)l e r{Sr-[^(R')+^(a)]/a} + 

J L J {ZTIT/ fi) in ' 2 

p - M [R'-R(r)] 2 /2r 

= 6 --^^ e ^T-[£ L (R')+£L(R)]/2} + Q{r 2y (g6) 

(2tit/ fx) 6n ' 2 

Eq. (^) again can be viewed in our general framework by defining the probability con- 
serving generalization of Eq. fl7B|) 

-/4R'-R(t)] 2 /2t 

P(R ' R - r)= (2,^' (87) 

and the remainder of the Green function is <5[R' — R(r)] with prefactor 

0(R',R,t) = e ^T-[^(R')+s L (R)]/2 }j (g8) 
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which is the analog of Eq. ( ff7|) . 

When employed for the branching Monte Carlo algorithm, the factorization given in 
Eqs. (|8?D and (|88| ) differs from the original factorization Eqs. ([7!]) and flTT| ) in two respects: 
(a) the walkers do not only diffuse but also drift towards the important regions, i.e., in the 
direction in which |^t(R)| is increasing; and (b) the branching term is better behaved since 
it depends on the local energyrather than the potential. In particular, if the trial function 
iPt(R) obeys cusp conditions^ then the local energy at particle coincidences is finite even 
though the potential may be infinite. If one were so fortunate that is the exact 

ground state, the branching factor would reduce to a constant, which can be chosen to be 
unity by choosing E T to be the exact energy of that state. 

The expressions, as written, explicitly contain the expression R(t), which has to be 
obtained by integration of the velocity V(R). Since we are using a short-time expansion 
anyway, this exact expression may be replaced by the approximation 



In the improvements discussed below, designed to reduce time-step error, this expression 
is improved upon so that regions where V diverges do not make large contributions to the 
overall time-step error. 



The absolute ground state of a Hamiltonian with particle exchange symmetry is bosonic. 
Consequently, unless one starts with a fermionic, i.e., antisymmetric wavefunction and im- 
plements the power method in a way that respects this antisymmetry, the power method 
will reduce the spectral weight of the fermionic eigenstate to zero relative to the weight of 
the bosonic state. The branching algorithm described above assumes all weights are pos- 
itive and therefore is incompatible with the requirement of preserving antisymmetry. The 
algorithm needs modification, if we are interested in the fermionic ground state. 

If the nodes of the fermionic ground state were known, they could be imposed as boundary 
conditions^ and the problem could be dealt with by solving the Schrodinger equation within 
a single, connected region bounded by the nodal surface, a region we shall refer to as a nodal 
pocket. Since all the nodal pockets of the ground state of a fermionic system are equivalent, 
this would yield the exact solution of the problem everywhere. Unfortunately, the nodes 
of the wavefunction of an n-particle system form a (3n — l)-dimensional surface, which 
should not be confused with the nodes of single-particle orbitals. Of this full surface, in 
general, only the (3n — 3)-dimensional subset, corresponding to the coincidence of two like- 
spin particles, is known. Hence, we are forced to employ an approximate nodal surface as 
a boundary condition to be satisfied by the solution of the Schrodinger equation. This is 
called the fixed-node approximation. Usually, one chooses for this purpose the nodal surface 
of an optimized trial wave function^, and such nodes can at times yield a very accurate 
results if sufficient effort is invested in optimizing the trial wavefunction. 

Since the imposition of the boundary condition constrains the solution, it is clear that 
the fixed-node energy is an upper bound on the true fermionic ground state energy. In 
diffusion Monte Carlo applications, the fixed-node energy typically has an error which is 




(89) 



2. Fixed-Node Approximation 
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five to ten times smaller than the error of the variational energy corresponding to the same 
trial wavefunction, though this can vary greatly depending on the system and the trial 
wavefunction employed. 

For the Monte Carlo implementation of this approach one has to use an approximate 
Green function, which, as we have seen, may be obtained by iteration of a short time 
approximation. To guide the choice of an approximant accurate over a wide time range, it is 
useful to consider some general properties of the fixed-node approximation. Mathematically, 
the latter amounts to solution of the Schrodinger equation in a potential that equals the 
original physical potential inside the nodal pocket of choice, and is infinite outside. The 
corresponding eigenstates of the Hamiltonian are continuous and vanish outside the nodal 
pocket. Note that the solution can be analytically continued outside the initial nodal pocket 
only if the nodal surface is exact, otherwise there is a derivative discontinuity at the nodal 
surface. These properties are shared by the Green function consistent with the boundary 
conditions. This can be seen by writing down the spectral decomposition for the evolution 
operator in the position representation 

G(R',R,r) = (R'|e- rW |R) = £ ^(R^e^^B) (90) 

i 

where the ipi are the eigenstates satisfying the required boundary conditions. For notational 
convenience only, we have assumed that the spectrum has a discrete part only and that 
the wavefunctions can be chosen to be real. The Green function vanishes outside the nodal 
pocket and generically vanishes linearly at the nodal surface, as do the wavefunctions. 

The Green function of interest in practical applications is the one corresponding to 
importance sampling, the similarity transform of Eq. ( |90| ) 

G(R',R,r) = ^ T (R') (R'| C -^|R)_* (91) 

■0t(R) 

This Green function vanishes at the nodes quadratically in its first index, which, in the 
Monte Carlo context is the one that determines to which state a transition is made from a 
given initial state. 

The approximate Green functions of Eqs. ( [79] ) and ( |5E| ) have tails that extend beyond 
the nodal surface and, consequently, walkers sampled from these Green functions have a 
finite probability of attempting to cross the node. Since expectation values ought to be 
calculated in the r — > limit the relevant quantity to consider is what fraction of walkers 
attempt to cross the node per unit time in the r — > limit. If the Green function of Eq. ( |79|) 
is employed, this is a finite number, whereas, if the importance-sampled Green function of 
Eq. (|86|) is employed, no walkers cross the surface since the velocity V is directed away from 
the nodes and diverges at the nodes. In practice, of course, the calculations are performed 
for finite r, but the preceding observation leads to the conclusion that in the former case it 
is necessary to reduce to zero the weight of a walker that has crossed a node, i.e., to kill the 
walker, while in the latter case one can either kill the walkers or reject the proposed move, 
since in the r — > limit they yield the same result. 

We now argue that, for finite r, rejecting moves is the choice with the smaller time-step 
error when the importance-sampled Green function is employed. Sufficiently close to a node, 
the component of the velocity perpendicular to the node dominates all other terms in the 
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Green function and it is illuminating to consider a free particle in one dimension subject to 
the boundary condition that ip have a node at x = 0. The exact Green function for this 
problem is 

G(x', x, r) = 1 [ e -M-'-) 2 /2r _ e -Mx'+2) 2 /2T] ^ (g2) 
while the corresponding importance sampled Green function is 

6{x', X, T) = X ' [ e -n(*'-*r/2T _ e -K-'+x) 2 /2r]_ (93) 

xJ2ttt/ h 

We note that the integral of the former, over the region x > 0, is less than one and decreases 
with time. In terms of the usual language used for diffusion problems, this is because of 
absorption at the x = boundary. In our case, this provides the mathematical justification 
for killing the walkers that cross. On the other hand, the integral of the Green function of 
Eq. (^) equals one. Consequently, for finite r it seems likely that algorithms that reject 
moves across the node, such as the one discussed in Section |VIB| yield a better approximation 
than algorithms that kill the walkers that cross the node. 

As mentioned, it can be shown that all the nodal pockets of the ground state of a 
fermionic system are equivalent and trial wavefunctions are constructed to have the same 
property. Consequently, the Monte Carlo averages will not depend on the initial distribution 
of the walkers over the nodal pockets. The situation is more complicated for excited states, 
since different nodal pockets of excited-state wavefunctions are not necessarily equivalent, 
neither for bosons nor for fermions. Any initial state with walkers distributed randomly 
over nodal pockets pockets, will evolve to a steady state distribution with walkers only in 
the pocket with the lowest average local energy, at least if we ignore multiple-node crossings 
and assume a sufficiently large number of walkers, so that fluctuations in the average local 
energy can be ignored. 



3. Problems with Simple Diffusion Monte Carlo 

The diffusion Monte Carlo algorithm corresponding to Eq. (|86l) is in fact not viable for a 
wavefunction with nodes for the following two reasons. Firstly, in the vicinity of the nodes 
the local energy of the trial function diverges inversely proportional to the distance to 
the nodal surface. For nonzero r, there is a nonzero density of walkers at the nodes. Since 
the nodal surface for a system with n electrons is 3n — 1 dimensional, the variance of the 
local energy diverges for any finite r. In fact, the expectation value of the local energy 
also diverges, but only logarithmically. Secondly, the velocity of the electrons at the nodes 
diverges inversely as the distance to the nodal surface. The walkers that are close to a 
node at one time step, drift at the next time step to a distance inversely proportional to 
the distance from the node. This results in a charge distribution with a component that 
falls off as the inverse square of distance from the atom or molecule, whereas in reality the 
decay is exponential. These two problems are often remedied by introducing cut-offs in the 
values of the local energy and the velocity^lS, chosen such that they have no effect in the 
t — ► limit, so that the results extrapolated to r = are correct. In the next section better 
remedies are presented. 
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B. Improved Diffusion Monte Carlo Algorithm 



1. The limit of Perfect Importance Sampling 

In the limit of perfect importance sampling, that is if ^t(R) = if>o(l&), the energy shift 
Et can be chosen such that the branching term in Eq. ( jSOf) vanishes identically for all R. 
In this case, even though the energy can be obtained with zero variance, the steady state 
distribution of the diffusion Monte Carlo algorithm discussed above is only approximately 
the desired distribution ip^, because of the time-step error in the Green function. However, 
since one has an explicit expression, ip^, for the distribution to be sampled, it is possible to 
use the Metropolis algorithm, described in Section [111 A|, to sample the desired distribution 



exactly. Although the ideal ?/>t(R) = ^o(R-) is never achieved in practice, this observation 
leads one to devise an improved algorithm that can be used when moderately good trial 
wavef unctions are known. 

If for the moment we ignore the branching term in Eq. (|80D, then we have the equation 

__LvV + V.(V/) = -f. (94) 

This equation has a known steady-state solution / = V4 f° r an y V'Tj which in the limit of 
perfect importance sampling is the desired distribution. However, the approximate drift- 
diffusion Green function used in the Monte Carlo algorithm defined by Eq. (|86D without the 
branching factor, is not the exact Green function of Eq. fl9~4l) . Therefore, for any finite time 
step t, we do not obtain as a steady state, even in the ideal case. Following Reynolds et 
alxB, we can change the algorithm in such a way that it does sample in the ideal case, 
which also reduces the time-step error in nonideaLpractical situations. This is accomplished 
by using a generalizedliZHi! Metropolis algorithm@. The approximate drift- diffusion Green 
function is used to propose moves, which are then accepted with probability 

P = mm i — \ , ,„. > 1 = 1 - 9, (95) 



in accordance with the detailed balance condition. 

As was shown above, the true fixed-node Green function vanishes outside the nodal 
pocket of the trial wavefunction. However, since we are using an approximate Green function, 
moves across the nodes will be proposed for any finite r. To satisfy the boundary conditions 
of the fixed-node approximation these proposed moves are always rejected. 

If we stopped here, we would have an exact and efficient variational Monte Carlo algo- 
rithm to sample from Now, we reintroduce the branching term to convert the steady- 
state distribution from to ipTipo- This is accomplished by reweighting the walkers with 
the branching factor [see Eq. d§6|)1 



^ w — J ex P{|[*S'(-R' / ) + <S f (R-)] r cfT} for an accepted move, 
\ exp[S , (R)r e ff] for a rejected move, 

where S is defined in Eq. (|82D . An effective time step t c s, which will be defined presently, 
is required because the Metropolis algorithm introduces a finite probability of not moving 
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forward and rejecting the proposed configuration. Before defining r e g, we note that an 
alternative to expression fl96|) is obtained by replacing the two reweighting factors by a 
single expression, 



Aw = exp 



|(5(R') + 5(R)) + g5(R)|r cff 



for all moves (97) 



This expression, written down with Eq. fl26|) in mind, yields somewhat smaller fluctuations 
and time-step error than expression (fH). 

Following Reynolds et alsB, an effective time step r e s is introduced in Eq.( |97[) to account 
for the changed rate of diffusion. We set 

Teff = T 7Ai?T' (98) 

where the angular brackets denote the average over all attempted moves, and AR are the 
displacements resulting from diffusion. This equals (Ai? 2 ) accepted / (AR 2 ) but again has some- 
what smaller fluctuations. 

An estimate of r e fj is readily obtained iteratively from sets of equilibration runs. During 
the initial run, r e g is set equal to r. For the next runs, the value of T e g is obtained from 
the values of r e s computed with Eq. ([58|) during the previous equilibration run. In practice, 
this procedure converges in two iterations, which typically consume less than 2% of the total 
computation time. Since the statistical errors in r e fj affect the results obtained, the number 
of Monte Carlo steps performed during the equilibration phase needs to be sufficiently large 
that this is not a major component of the overall statistical error. 

The value of r c g is a measure of the rate at which the Monte Carlo process generates 
uncorrelated configurations, and thus a measure of the efficiency of the computation. Since 
the acceptance probability decreases when r increases, r e g has a maximum as a function of 
r. However, since the time-step error increases with r, it is advisable to use values of r that 
are smaller than this "optimum" . 

Algorithms that do not exactly simulate the equilibrium distribution of the drift-diffusion 
equation if the branching term is suppressed, i.e., algorithms that do not use the Metropolis 
accept/reject mechanism, can for sufficiently large r have time-step errors that make the 
energy estimates higher than the variational energy. On the other hand, if the drift-diffusion 
terms are treated exactly by including an accept/reject step, the energy, evaluated for any 
r, must lie below the variational energy, since the branching term enhances the weights of 
the low-energy walkers relative to that of the high-energy walkers. 



2. Persistent Configurations 

As mentioned above, the accept/reject step has the desirable feature of yielding the exact 
electron distribution in the limit that the trial function is the exact ground state. However, 
in practice the trial function is less than perfect and as a consequence the accept/reject 
procedure can lead to the occurrence of persistent configurations, as we now discuss. 

For a given configuration R, consider the quantity Q = (qAw), where q and Aw are the 
rejection probability and the branching factor given by Eqs. fl95| ) and fl9~7|). The average in 
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FIG. 2. Illustration of the persistent configuration catastrophe. The dotted horizontal line is 
the true fixed-node energy for a simple Be wavefunction extrapolated to r = 0. 

the definition of Q is over all possible moves for the configuration R under consideration. 
If the local energy at R is relatively low and if r e e is sufficiently large, Q may be in excess 
of one. In that case, the weight of the walker at R, or more precisely, the total weight of 
all walkers in that configuration will increase with time, except for fluctuations, until the 
time-dependent trial energy Et adjusts downward to stabilize the total population. This 
population contains on average a certain number of copies of the persistent configuration. 
Since persistent configurations must necessarily have an energy that is lower than the true 
fixed-node energy, this results in a negatively biased energy estimate. The persistent con- 
figuration may disappear because of fluctuations, but the more likely occurrence is that it 
is replaced by another configuration that is even more strongly persistent, i.e., one that has 
an even larger value of Q = (qAw). This process produces a cascade of configurations of 
ever decreasing energies. Both sorts of occurrences are demonstrated in Fig. |2|. Persistent 
configurations are most likely to occur near nodes, or near nuclei if tp^ does not obey the 
cusp conditions. Improvements to the approximate Green function in these regions, as dis- 
cussed in the next section, help to reduce greatly the probability of encountering persistent 
configurations to the point that they are never encountered even in long Monte Carlo runs. 

Despite the fact that the modifications described in the next section eliminate persistent 
configurations for the systems we have studied, it is clearly desirable to have an algorithm 
that cannot display this pathology even in principle. One possible method is to replace r e s 
in Eq. (|96|) by r for an accepted move and by zero for a rejected move. This ensures that 
Aw never exceeds unity for rejected moves, hence eliminating the possibility of persistent 
configurations. Further, this has the advantage that it is not necessary to determine r c g. 
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Other possible ways to eliminate the possibility of encountering persistent configurations are 
discussed in Ref. 150 . 



3. Singularities 



The number of iterations of Eq. (74) required for the power method to converge to the 
ground state grows inversely with the time step r. Thus, the statement made above, viz. 
that the Green function of Eq. ( |79|) is in error only to 0(t 2 ), would seem to imply that the 
errors in the electron distribution and the averages calculated from the short-time Green 
function are of 0(t). However, the presence of non-analyticities and divergences in the 
local energy and the velocity may invalidate this argument: the short-time Green function 
may lack uniform convergence in r over 3n-dimensional configuration space. Consequently, 
an approximation that is designed to be better behaved in the vicinity of singularities and 
therefore behaves more uniformly over space may outperform an approximation that is 
correct to a higher order in r for generic points in configuration space, but ignores these 
singularities. 

Next we discuss some important singularities that one may encounter and their impli- 
cations for the diffusion Monte Carlo algorithm. The behavior of the local energy and 
velocity V near nodes of ip^ are described in Table |. Although the true wave function ip 
has a constant local energy Eq at all points in configuration space, the local energy of "0T 
diverges at most points of the nodal surface of ip^ for almost all the i/jt that are used in prac- 
tice, and it diverges at particle coincidences (either electron-nucleus or electron-electron) for 
wave functions that fail to obey cusp conditions^. The velocity V diverges at the nodes and 
for the Coulomb potential has a discontinuity at particle coincidences both for approximate 
and for the true wavefunction. For the nodeless wavefunction of Lennard- Jones particles in 
their ground state,! V diverges as r -6 in the inter-particle distance r. Since the only other 
problem for these bosonic systems is the divergence of E^ at particle coincidences, we con- 
tinue our discussion for electronic systems and refer the interested reader to the literature! 
for details. 



TABLE I. Behavior of the local energy E^ and velocity v as a function of the distance R± of 
an electron to the nearest singularity. The behavior of various quantities is shown for an electron 
approaching a node or another particle, either a nucleus or an electron. The singularity in the local 
energy at particle overlap is only present for a that fails to satisfy the cusp conditions. 



Region 


Local energy 


Velocity 


Nodes 


£"l ~ ^TT^ f° r V'T 
E L = Eq for 


v ~ it 


Electron- 
nucleus / electron 


E^ ~ for some ■i/'T 
E L = E for 


v has a discontinuity 
for both tpT and ipo 



The divergences in the local energies cause large fluctuations in the population size: 
negative divergences lead to large local population densities and positive divergences lead to 
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Schematic comparing true and approximate Green functions 
tl 
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FIG. 3. Schematic comparing the qualitative behaviors of the true G with the approximate G of Eq. (|8C 
of an electron that is located at x = —0.2 at time t = and evolves in the presence of a nearby nucleus 
located at x = 0. The Green functions are plotted for three times: t\ < ti < t%. 



small ones. The divergence of V at the nodes typically leads to a proposed next state of a 
walker in a very unlikely region of configuration space and is therefore likely to be rejected. 
The three-dimensional velocity of an electron which is close to a nucleus is directed towards 
the nucleus. Hence the true Green function, for sufficiently long time, exhibits a peak at 
the nucleus, but the approximate Green function of Eq. fl86|) cause electrons to overshoot 
the nucleus. This is illustrated in Fig. ||], where we show the qualitative behavior of the 
true Green function and of the approximate Green function for an electron which starts at 
x = —0.2 in the presence of a nearby nucleus at x — 0. At short time ti, the approximate 
Green function of Eq. fl5U|) agrees very well with the true Green function. At a longer time 
t 2 , the true Green function begins to develop a peak at the nucleus which is absent in the 
approximate Green function, wheras at a yet longer time £3, the true Green function is 
peaked at the nucleus while the approximate Green function has overshot the nucleus. 

The combined effect of these nonanalyticities is a large time-step error, which can be 
of either sign in the energy, and large statistical uncertainty in the computed expectation 
values. We now give a brief description of how these nonanalyticities are treated. The diver- 
gence of the local energy at particle coincidences is cured simply by employing wavefunctions 
that obey cusp conditions^! The other nonanalyticities are addressed by employing a mod- 
ification of the Green function of Eq. fl36|) such that it incorporates the divergence of 
and V at nodes and the discontinuity in V at particle coincidences but smoothly reduces to 
Eq. (RBI) in the short-time limit or in the limit that the nearest nonanalyticity is far away. 



The details can be found in Ref. 50. The modified algorithm has a time-step error which 
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is two to three orders of magnitude smallerEl than the simple algorithm corresponding to 
Eq. ( jS6|) with cutoffs imposed on and V. 

We have used the application to all-electron electronic structure calculations to illustrate 
the sort of problems that can lead to large time-step errors and their solution. Other 
systems may exhibit only a subset of these problems or a modified version of them. For 
example, in calculations of bosonic clusters! there are no nodes to contend with, while, in 
electronic structure calculations employing pseudopotentialsl^H^, or pseudo-Hamiltoniansi! 
the potential need not diverge at electron-nucleus coincidences. We note, in passing, that 
the use of the latter methods has greatly extended the practical applicability of quantum 
Monte Carlo methods to relatively large systems of practical interestEl'H but at the price of 
introducing an additional approximation^. 



VII. CLOSING COMMENTS 

The material presented above was selected to describe from a unified point of view 
Monte Carlo algorithms as employed in seemingly unrelated areas in quantum and statistical 
mechanics. Details of applications were given only to explain general ideas or important 
technical problems, such as encountered in diffusion Monte Carlo. We ignored a whole 
body of literature, but we wish to just mention a few topics. Domain Green function 
Monte CarloiHil is one that comes very close to topics that were covered. In this method 
the Green function is sampled exactly by iterating upon an approximate Green function. 
Infinite iteration, which the Monte Carlo method does in principle, produces the exact 
Green function. Consequently, this method lacks a time-step error, and in this sense has the 
advantage of being exact. In practice, there are other reasons, besides the time-step error 
that force the algorithm to move slowly through state space, and currently the available 
algorithms seem to be less efficient than diffusion Monte Carlo, even when one accounts for 
the effort required to perform the extrapolation to vanishing time step. 

Another area that we just touched upon in passing is path integral Monte Carlo.ll Here 
we remind the reader that path integral Monte Carlo is a particularly appealing alternative 



for the evaluation of matrix elements such as such as in Eq. (15). The advantage of 



L a/3 

this method is that no products of weights appear, but the disadvantage is that it seems 
to be more difficult to move rapidly through state space. This is a consequence of the fact 
that branching algorithms propagate just a single time slice through state space, whereas 
path integral methods deal with a whole stack of slices, which for sampling purposes tends 
to produce a more rigid object. Finally, it should be mentioned that we ignored the vast 
literature on quantum lattice systems^. 

In splitting the evolution operator into weights and probabilities [see Eq. (|33|) 1 we as- 
sumed that the weights were non- negative. To satisfy this requirement, the fixed-node 
approximation was employed for fermionic systems. An approximation in the same vein is 
the fixed-phase approximation,!^ which allows one to deal with systems in which the wave- 
function is necessarily complex valued. The basic idea here is analogous to that underlying 
the fixed-node approximation. In the latter, a trial function is used to approximate the nodes 
while diffusion Monte Carlo recovers the magnitude of the wavefunction. In the fixed phase 
approximation, the trial function is responsible for the phase and Monte Carlo produces the 
magnitude of the wavefunction. 
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